Multivariate analysis methods improve the selection of strawberry genotypes with low cold requirement

Methods of multivariate analysis is a powerful approach to assist the initial stages of crops genetic improvement, particularly, because it allows many traits to be evaluated simultaneously. In this study, heat-tolerant genotypes have been selected by analyzing phenotypic diversity, direct and indirect relationships among traits were identified, and four selection indices compared. Diversity was estimated using K-means clustering with the number of clusters determined by the Elbow method, and the relationship among traits was quantified by path analysis. Parametric and non-parametric indices were applied to selected genotypes using the magnitude of genotypic variance, heritability, genotypic coefficient of variance, and assigned economic weight as selection criteria. The variability among materials led to the formation of two non-overlapping clusters containing 40 and 154 genotypes. Strong to moderate correlations were found between traits with direct effect of the number of commercial fruit on the mass of commercial fruit. The Smith and Hazel index showed the greatest total gains for all criteria; however, concerning the biochemical traits, the Mulamba and Mock index showed the highest magnitudes of predicted gains. Overall, the K-means clustering, correlation analysis, and path analysis complement the use of selection indices, allowing for selection of genotypes with better balance among the assessed traits.


Results
The optimal K value for the population was determined to be 2 according to the Elbow Method (Fig. 1). Two clusters were generated without overlapping in the K-means clustering (Fig. 2). The control 'Camino Real' and 40 seedling genotypes were in group 1. The control 'Camarosa' and 154 seedling genotypes were in group 2. Group 1 was better than group 2, according to the 5% confidence interval, for yield-related traits mass of commercial  www.nature.com/scientificreports/ fruits (MCF), number of commercial fruits (NCF), and average mass commercial fruits (AMCF), as well as for the biochemical characteristics Ratio, ascorbic acid (AA), and anthocyanins (ANT). Contrarily, no significant difference in total pectin (TP) was measured between groups 1 and 2 (Table 1). Twelve significant and positive correlations were found by the t-test (p < 0.05) among the 21 pairs of traits evaluated (Fig. 3). The most robust correlations were obtained for yield-related characteristics. High phenotypic correlations (r > 0. 66) were measured between MCF and NCF (r = 0.96). Medium correlations (0.33 < r < 0. 67) were measured between MCF and AMCF; MCF and Ratio; and NCF and Ratio, presenting r values of 0.55, 0.53, and 0.53, respectively. Low correlations (r < 0.34) were measured between NCF and AMCF (r = 0.34), between yield-and biochemical-related traits (MCF and AA (r = 0.32); MCF and ANT (r = 0.29); NCF and AA (r = 0.34); NCF and ANT (r = 0.32); and AMCF and Ratio (r = 0.27)), and between the biochemical traits Ratio and AA (r = 0.18) and Ratio and ANT (r = 0.32).
By unfolding the correlations through path analysis for a single causal diagram, direct and indirect effects for the independent trait MCF and the other characteristics were identified (Table 2). Biochemical-related traits had no direct or indirect effect on MCF. The NCF had a direct effect (0.88) on MCF and an indirect effect on AMCF (0.086). Contrarily, AMCF had an indirect effect on NCF (0.30) greater than the direct effect on itself (0.25). Ratio had a small direct effect on MCF, while presented a greater indirect effect on NCF (0.46) and AMCF (0.068).
The total percentage gains obtained by the four indices under the four criteria (GV, h 2 , GCV, and EW) ranged from 366 to 386% (Table 3) for simultaneous selection of yield and biochemical traits. The Smith-Hazel index showed the highest total gains with 386%, for all criteria, followed by the Genotype-Ideotype distance index with 384% for GCV, the Mulamba and Mock index with 384% and 383% for EW and GCV, respectively, and by the  Table 1. Confidence interval of the mean values of the variables evaluated in the two K-means clusters with the respective number of genotypes (p < 0.05). MCF mass of commercial fruits (g plant −1 ), AMCF average mass of commercial fruits (g fruit −1 ), NCF number of commercial fruits (fruits plant −1 ), Ratio, soluble solids (Brix°)/ titratable acidity (g citric acid 100 g −1 pulp); TP total pectin (g total pectin 100 g −1 pulp), AA ascorbic acid (mg ascorbic acid 100 g −1 pulp), and ANT anthocyanins (mg cyanidin-3-glucoside 100 g −1 pulp).  (Table 3). Regarding yield and biochemical traits, the Mulamba and Mock index provided greater gains for the biochemical-related traits, in relation to h 2 (86%) and GCV (81%), followed by the Genotype-Ideotype index, under the criteria h 2 (79%) and GCV (75%). The Smith and Hazel index, despite having shown the greatest gains for yield traits, showed the lowest gains of biochemical traits, in which the four indices under the four criteria selected 53 genotypes (Table 4). From this total, 38 are located in group 1 and 15 in group 2, according to the K-means clustering. A total of 28 genotypes were selected by all indices for all criteria, in which only one belonged to group 1 of the K-means cluster analysis. Eleven genotypes were selected in some indices for all criteria and only in some criteria in other indices. Another eleven genotypes were selected in some indices for some criteria, and three genotypes were selected by some indices for all criteria. The crosses 'Camarosa' × Aromas' and 'Camarosa' × 'Sweet Charlie' stood out, with 14 and nine selected hybrids, respectively. The other crosses showed the following number of selected hybrids: 'Dover' × ' Aromas'-6, 'Oso Grande' × 'Tudla'-5, 'Festival' × ' Aromas'-5, ' Aromas' × 'Sweet Charlie'-3, 'Tudla' × ' Aromas'-3, 'Dover' × 'Sweet Charlie'-3, 'Tudla' × 'Sweet Charlie' -3, and 'Festival' × 'Sweet Charlie'-2.

Discussion
Brazilian strawberry production depends almost entirely on cultivars developed in foreign breeding programs that, due to aspects related to genotype × environment interactions, may present lower yield, lower biochemical quality, greater susceptibility to pests and diseases, increasing production costs 5 . Nonetheless, these imported cultivars have the potential to be explored in intraspecific crosses aiming to express the existing variability in the species 13,27 . Strawberry is an octoploid species that has gone through various levels of ploidization throughout evolutionary history 28 . Strawberry also harbors millions of DNA variants of the subgenomes of the species that gave rise to actual strawberry fruit 29 . In general, strawberry presents great variability in hybrids obtained from crosses, which favors the selection of new cultivars 30 . Significant variability with identification of superior hybrids has been found in phenotypic analyses for yield and physicochemical traits in populations obtained from crosses between commercial strawberry cultivars in Brazil 9,12,13,31-33 . In addition, genetic studies with hybrids and commercial cultivars based on molecular markers have shown that the germplasm of the Brazilian strawberry breeding program has genetic variability and divergence; therefore, it has a high potential for launching new cultivars 9,34 .
For the population analyzed in this study, the Elbow method established two clusters, which presented no overlap in the K-means clustering, showing variability in the analyzed population and complete dissimilarity between the two groups formed. The highest phenotypic correlation for the independent variable mass of commercial fruits (MCF) was obtained for the number of commercial fruits (NCF) (0.96), which had a high direct effect (0.88) in the path analysis. The average mass of commercial fruits (AMCF), which also had a medium and positive correlation (0.55) with the MCF, demonstrated in the path analysis that its indirect effect (0.29) on NCF is superior to the direct effect (0.25). Diel et al. 35 found a direct effect of the total number of fruits (0.81), and an indirect effect of the mass of commercial fruits, via the total number of fruits (0.71), while the average fruit mass showed a direct relationship of 0.22. Authors results corroborate with our study and these positive findings suggest that direct selection via number of commercial fruits has a greater effect on yield and indirectly benefits the average mass of commercial fruits. Table 3. Estimates of percentage gains obtained by simultaneous selection with application of four indices based on four criteria of economic weights for seven traits evaluated in 10 populations of Fragaria × ananassa Dusch. MCF mass of commercial fruits (g plant −1 ), AMCF average mass of commercial fruits (g fruit −1 ), NCF number of commercial fruits (fruits plant −1 ), Ratio, soluble solids (Brix°)/titratable acidity (g citric acid 100 g −1 pulp); TP total pectin (g total pectin 100 g −1 pulp), AA ascorbic acid (mg ascorbic acid 100 g −1 pulp), and ANT anthocyanins (mg cyanidin-3-glucoside 100 g −1 pulp). GV genotypic variance, h 2 herdability, GCV genetic coefficient variation and EW economic weight assigned by the breeder. www.nature.com/scientificreports/ , AMCF average mass of commercial fruits (g fruit −1 ), NCF number of commercial fruits (fruits plant −1 ), Ratio, soluble solids (Brix°)/titratable acidity (g citric acid 100 g −1 pulp); TP total pectin (g total pectin 100 g −1 pulp), AA ascorbic acid (mg ascorbic acid 100 g −1 pulp), and ANT anthocyanins (mg cyanidin-3-glucoside 100 g −1 pulp).   www.nature.com/scientificreports/ The balance between soluble solids and titratable acidity (Ratio) represents the equilibrium between sweetness and acidity. This balance combined with aroma and other biochemical traits makes up flavor, which has great importance in sensory perception and consumer preference 5,6 . In the present study, Ratio showed a moderate and positive phenotypic correlation with the mass of commercial fruits (0.52) and number of commercial fruits (0.53); however, when unfolding this correlation, a negative direct effect was observed, while the indirect effect was positive via NCF. In agreement with the present study, Diel 35 found a negative direct effect (− 0.10) and a positive indirect effect (0.15) of Ratio via the total number of fruits on the total fruit mass. Direct effects of the number of strawberry fruit on production per plant were also reported by Ara et al. 36 and Garg 11 , while Sighn et al. 37 stated that the greatest direct positive effects came from flower number and fruit length. These results evince that the selection of strawberry genotypes for mass of commercial fruits can be directly performed via the number of commercial fruits and that genotypes with numerous fruits, but of medium size, tend to have a better Ratio than genotypes with large fruits.
Selecting genotypes that balance yield and biochemical traits simultaneously is a complex task 10 . The use of selection indices, both parametric and non-parametric, has been useful to identify more balanced hybrids of diverse crops, such as sweet potato 26 , alfalfa 38 , soybean 25,39,40 , potato 41 , maize 42 , acai 43 , passion fruit 23,44 , and, more recently, strawberry 13,27 .
In the present study, the Mulamba and Mock and Genotype-Ideotype indices were more sensitive to the use of different criteria, showing greater differences between gains. Cruz et al. 45 recommend the use of statistics obtained from the analysis of experimental data as economic weights (EW) since it relates to the genotypic variance, they are dimensionless and maintain a certain proportionality among the evaluated traits. In the present study, the greatest gains for yield traits were obtained by the Smith and Hazel index (330.14%); however, it showed no difference between the statistical criteria or assigned weights. Contrarily, the greatest gains for the biochemical-related traits were obtained by the Mulamba and Mock and Genotype-Ideotype indices, under the criteria of h 2 and GCV with 86.34% and 81.06%; 79.41% and 74.87%, respectively. Vieira et al. 27 , evaluating strawberry genotypes, also reported the greatest increments for yield traits with the Smith and Hazel index and for biochemical characteristics applying the Mulamba and Mock index. It occurs because parametric tests use the distribution parameters to calculate the statistics, while non-parametric tests use ranks assigned to ordered data and are uninfluenced by the probability distribution of the data evaluated 46 . Thus, the non-parametric Mulamba and Mock index is less sensitive, mathematically, to traits that present wide variance, such as number of fruits.
From the 194 genotypes analyzed, 28 were selected for all indices, under all criteria, in which 27 belong to group 1 of the K-means clustering. The use of different indexes and criteria tend to present very similar results for the initial positions of the selected genotypes. Bernardo et al. 47 analyzed studies in several agronomic crops and concluded that, if the population is large enough, any selection index applied judiciously is useful for the simultaneous improvement of multiple traits, regardless of the method used. Nevertheless, the indices start to select different hybrids for the different criteria with the progress of positions.
The crosses with the highest number of selected hybrids were 'Camarosa' × ' Aromas' and 'Camarosa' × 'Sweet Charlie' . Similarly, Galvão et al. 28 identified the best hybrids for yield traits in the cross between 'Camarosa' × ' Aromas' . Camarosa has been reported as a highly productive cultivar, with large, firm, and tasty fruits 48 , being one of the most planted short-day cultivars in the world 49 . The presence of large number of favorable alleles in 'Camarosa' and ' Aromas' 33 and their high productive potential 50,51 make them promising parents for strawberry breeding programs 5 . Camargo et al. 32 also found and selected the best hybrids coming from the crosses 'Camarosa' × ' Aromas' and 'Camarosa' × 'Sweet Charlie' , concerning biochemical traits.
The dendrogram generated from the 53 selected genotypes led to the formation of five groups, demonstrating that this population still has variability that can be further investigated.

Conclusion
K-means clustering, correlation analysis, and path analysis complement the use of selection indices, leading to the selection of hybrids with better balance between yield-and biochemical-related traits in strawberry. This combined approach is more promising than the direct selection based on only one or a few traits. Furthermore, the multivariate analysis methods were efficient in selecting strawberry genotypes for multi-characters.
The number of commercial fruits was more relevant to the mass of commercial fruits than the average mass of commercial fruits. Therefore, NCF is a trait of greater importance for the selection of strawberry genotypes aiming at yield. The Smith and Hazel index showed the greatest gain for yield traits. Possibly because it is mathematically more influenced by characteristics with greater variability such as yield. The Mulamba and Mock and Genotype-Ideotype indices, both non-parametric, showed the highest estimated gains for biochemical traits under the criteria of h 2 and GCV. The crosses with the highest number of selected hybrids were 'Camarosa' × ' Aromas' and 'Camarosa' × 'Sweet Charlie' . The selected population of 53 hybrids still has variability with potential to be exploited.

Material and methods
The material and methods of our study was performed in accordance with the relevant guidelines and regulations. Plant material and replications followed the regulations of the Ministry of Agriculture, Cattle and Supplying of Brazil.
Plant material. Ten populations were obtained from biparental crosses among strawberry cultivars traditionally grown in South America ( www.nature.com/scientificreports/ cultivars based on photoperiod responses, except by Aromas which is a day-neutral cultivar 52 . Hybridization was performed following Chandler et al. 53 . The choice between cultivars to carry out the crosses to obtain segregating populations was based on the genetic dissimilarity study carried out by Morales et al. 34 . After crossing, achenes present in the fruits were removed and germinated in vitro, as described by Galvão et al. 31 . At 60 days after germination, the seedlings were transplanted to 72-cell polypropylene trays containing biostabilized substrate. Strawberry transplanting was performed in a low-tunnel system 0.8 m high with beds 1 m wide × and 0.25 m high surface covered with a black polyethylene film 30-µm thick. To cover the tunnels, 120-µm thick transparent polyethylene film was used. The plant spacing was 0.30 × 0.30 m between plants and 0.40 m between rows.
Beds were fertilized with 1,650 kg ha −1 of simple superphosphate, 250 kg ha −1 of potassium chloride, and 295 kg ha −1 of urea, based on the soil chemical analysis in accordance with the recommendations for the strawberry crop 52 . Nutritional replacement was performed via fertigation twice a week. Irrigation water was provided using a micro-drip system and followed the crop water demand. Additionally, for phytosanitary preventive control, applications of Thiamethoxam and Azoxystrobin + Difenoconazole were carried out. Strawberry fruits were harvested at maturity stage when 75% of fruit were red.
The experiment was conducted using the randomized block design with three replications and ten plants per plot. There was a total of 194 F 1 experimental hybrids and two commercial controls ('Camarosa' and 'Camino Real').

Yield and biochemical traits evaluated.
Traits that showed significant differences in the analysis of variance were used in the further analyses, namely: mass of commercial fruits (MCF) (g plant −1 ), number of commercial fruits (NCF) (fruit plant −1 ), average mass of commercial fruits (AMCF) (g fruit −1 ), ratio between soluble solids (SS) (Brix°) and titratable acidity (TA) (g citric acid 100 g −1 pulp (Ratio), total pectin (g total pectin 100 g −1 pulp), ascorbic acid content (AA) (mg ascorbic acid 100 g −1 pulp), and anthocyanin content (ANT) (mg cyanidin-3-glucoside 100 g −1 pulp).The biochemical traits were assessed in samples of commercial ripe strawberries (above 10 g), stored at − 2 °C right after harvest. Strawberries were thawed, crushed, and homogenized. Using the homogenized pulp, soluble solids content was measured with an Optech bench refractometer. Titratable acidity was determined by the titration method, with aliquots of 10 g of strawberry pulp plus 100 mL of distilled water 0.1 mol L −1 NaOH standard solution up to pH 8.2, which corresponds to the turning point of phenolphthalein 56 . The total pectin was determined by the method described by McCready and McComb 57 , and calorimetrically determined while using the carbazole reaction, according to the methodology that was described by Bitter and Muir 58 . Ascorbic acid was obtained by the standard titration method of the Association of Official Analytical Chemists (AOAC), modified by Benassi and Antunes 59 . Whereas anthocyanin was deter-  69,70 to estimate genotypic variance (GV), heritability (h 2 ), and genotypic coefficient variation (GCV). Economic weights (EW) were assigned (Table 6). Subsequently, two parametric indices, the classic index from Smith 17 and Hazel 18 and the base index 19 , and two non-parametric indices, the rank-sum-based index 20 , and the genotype-ideotype distance index 21 were used to select.
The genotypic aggregate (H) in the classic Smith-Haze index it is obtained by the expression H = a 1 g 1 + a 2 g 2 + · · · a n g n , where "a" is the n × 1 dimension vector of the economic weights and "g" is the p × n dimension matrix of unknown genetic values of the "n" traits for the "p" families or progenies evaluated. The index (I) consists of a linear combination of the "x" values measured of each trait, pondered by a coefficient. It is obtained by the expression: I = b 1 x 1 + b 2 x 2 + · · · b n x n . , where the coefficient "b" is an (n × 1) vector estimated from the expression b = P −1 Ga, where "P −1 " is the inverse of the phenotypic covariance matrix; "G" is the genetic covariance matrix and "a" is the (n × 1) vector of the economic weights assigned to the traits 17,18 .
This index of Mulamba and Mock is obtained by the expression: I = r 1 + r 2 + · · · + r n , where "I" is the index value for a given individual, r j is the rank of an individual in relation to the j-th variable, and "n" is the number of traits considered in the index. This procedure allows the ranking order of traits to have different weights, as specified by the breeder. Thus, we have that I = p 1 r 1 + p 2 r 2 + · · · + p n r n , with p j being the economic weight attributed by the breeder to the j-th trait 20 .
To obtain the genotype-ideotype index, the values that will express the distance between genotypes and the ideotype are calculated by the expression: I DGI = √1/n Σ(y ij − vo j ) 2 . The best genotypes were identified, and selection gains were estimated based on I DGI . Based on the values of the ideotype (Y ij ), the principal components analysis was performed to obtain the eigenvalues and eigenvectors associated with the correlation matrix between the analyzed variables. The distances of the genotypes in relation to the ideotype were estimated. This process allows the selection of genotypes closer to the optimal pattern defined by the breeder (ideotype) 21 .
Selection gains [SG (%)] in the base index 19 were estimated with the following expression: SG (%) = 100 h 2 (Xs − Xo)/Xo, where Xs is the average genotypic value of selected hybrids, Xo is the average genotypic value of all hybrids, and h 2 is the heritability of the trait of interest. Heritability was obtained by the ratio between genotypic and phenotypic variance, as h 2 =σ 2 g /σ 2 p , where σ 2 g is the genotypic variance and σ 2 p is the phenotypic variance 19 . Lastly, the optimal number of clusters was identified by Dindex index with R package NbClust 71 to generate a circular hierarchical dendrogram created with all selected hybrids and controls, in all parameters and indices using the R packages vegan v.2.5-6 72 , for the standardization of data, ape v.5.0 73 , and cluster v.2.1.0 74 .

Data availability
The datasets used and/or analyzed during the current study is available from the corresponding author on reasonable request. Table 6. Economic weights criteria used in the application of selection indices for trait analysis in 10 populations of Fragaria × ananassa Dutch. MCF mass of commercial fruits (g plant −1 ), AMCF average mass of commercial fruits (g fruit −1 ), NCF number of commercial fruits (fruits plant −1 ), Ratio, soluble solids (Brix°)/ titratable acidity (g citric acid 100 g −1 pulp); TP total pectin (g total pectin 100 g −1 pulp), AA ascorbic acid (mg ascorbic acid 100 g −1 pulp), and ANT anthocyanins (mg cyanidin-3-glucoside 100 g −1 pulp).